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Abstract 

We propose one-class support measure ma- 
chines (OCSMMs) for group anomaly detec- 
tion. Unlike traditional anomaly detection, 
OCSMMs aim at recognizing anomalous ag- 
gregate behaviors of data points. The OC- 
SMMs generalize well-known one-class sup- 
port vector machines (OCSVMs) to a space 
of probability measures. By formulating the 
problem as quantile estimation on distribu- 
tions, we can establish interesting connec- 
tions to the OCSVMs and variable kernel 
density estimators (VKDEs) over the input 
space on which the distributions are defined, 
bridging the gap between large-margin meth- 
ods and kernel density estimators. In partic- 
ular, we show that various types of VKDEs 
can be considered as solutions to a class of 
regularization problems studied in this pa- 
per. Experiments on Sloan Digital Sky Sur- 
vey dataset and High Energy Particle Physics 
dataset demonstrate the benefits of the pro- 
posed framework in real- world applications. 



1 Introduction 

Anomaly detection is one of the most important tools 
in all data-driven scientific disciplines. Data that do 
not conform to the expected behaviors often bear some 
interesting characteristics and can help domain experts 
better understand the problem at hand. However, in 
the era of data explosion, the anomaly may appear 
not only in the data themselves, but also as a result 
of their interactions. The main objective of this paper 
is to investigate the latter type of anomalies. To be 
consistent with the previous works ( Poczos et al.ll2011 
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Figure 1: An illustration of two types of group anoma- 
lies. An anomalous group may be a group of anoma- 
lous samples which is easy to detect (unfilled points). 
In this paper, we are interested in detecting anomalous 
groups of normal samples (filled points) which is more 
difficult to detect because of the higher-order statis- 
tics. Note that group anomaly we are interested in 
can only be observed in the space of distributions. 



Like traditional point anomaly detection, the group 
anomaly detection refers to a problem of finding pat- 
terns in gro ups of data that do not conform to expec ted 



behaviors ( Poczos et al. 2011 , Xiong et al. 2011al fb) 



Xiong et alJ[2011albh . we will refer to this problem as 
a group anomaly detection, as opposed to a traditional 
point anomaly detection. 



That is, an ultimate goal is to detect interesting ag- 
gregate behaviors of data points among several groups. 
In principle, anomalous groups may consist of individ- 
ually anomalous points, which are relatively easy to 
detect. On the other hand, anomalous groups of rela- 
tively normal points, whose behavior as a group is un- 
usual, is much more difficult to detect. In this work, 
we are interested in the latter type of group anomalies. 
Figure [T] illustrates this scenario. 

Group anomaly detection may shed light in a wide 
range of applications. For example, a Sloan Digi- 
tal Sky Survey (SDSS) has produced a tremendous 
amount of astronomical data. It is therefore very cru- 
cial to detect rare objects such as stars, galaxies, or 
quasars that might lead to a scientific discovery. In 
addition to individual celestial objects, investigating 
groups of them may help astronomers understand the 
universe on larger scales. For instance, the anomalous 



group of galaxies, which is the smallest aggregates of 
galaxies, may reveal interesting phenomena, e.g., the 
gravitational interactions of galaxies. 

Likewise, a new physical phenomena in high energy 
particle physics such as Higgs boson appear as a tiny 
excesses of certain types of collision events among a 
vast background of known physics in particle detectors 
(|BhatJl201lUVatanen et all [2012h . Investigating each 
collision event individually is no longer sufficient as the 
individual events may not be anomalies by themselves, 
but their occurrence together as a group is anomalous. 
Hence, we need a powerful algorithm to detect such a 
rare and highly structured anomaly. 

Lastly, the algorithm proposed in this paper can be 
applied to point anomaly detection with substantial 
and heterogeneous uncertainties. For example, it is 
often costly and time-consuming to obtain the full 
spectra of astronomical objects. Instead, relatively 
noisier measurements are usually made. In addition, 
the estimated uncertainty which represents the uncer- 
tainty one would obtain from multiple observations is 
also available. Incorporating these uncertainties has 
been shown to improve the performance of the learn 



Ross et alJl2012h 



ing systems (|Boyv et al.ll2011llKirkpatrick et al.ll2011 



The anomaly detection has been intensively studied 
( Chandola et al. ( 2009() and references therein). How- 
ever, few attempts have been made on developing suc- 
cessful group anomaly detection algorithms. For ex- 
ample, a straightforward approach is to define a set 
of features for each group and apply standard point 
anomaly detection (jChan and Mahonevl 120051 ) . De- 
spite its simplicity, this approach requires a specific 
domain knowledge to construct appropriate sets of fea- 
tures. Another possibility is to first identify the indi- 
vidually aiiojnaIous_p_oints and then find their aggre- 
gations ([Das et al.l 120081 ). Again, this approach relies 
only on the detection of anomalous points and thus 
cannot find the anomalous groups in which their mem- 
bers are perfectly normal. Successful group anomaly 
detectors should be able to incorporate the higher- 
order statistics of the groups. 

Recently, a family of hierarchical probabilistic mod- 
els based on a Latent Dirichlet Allocation (LDA) 
( Blei et aT]l2003l ) has been proposed to cope wit h both 



types of group anomalies ( Xiong et al. 201 lallbh . In 



these models, the data points in each group are as- 
sumed to be one of the K different types and gener- 
ated by a mixture of K Gaussian distributions. Al- 
though the distributions over these K types can vary 
across M groups, they share common generator. The 
groups that have small probabilities under the model 
are marked as anomalies using scoring criteria defined 



as a combination of a point-based anomaly score and a 
group-based anomaly score. The Flexible Genre Model 
(FGM) recently exten ds this idea to mode l more com- 
plex group structures (jXiong et al.ll2011al ). 



Instead of employing a generative approach, we pro- 
pose a simple and efficient discriminative way of de- 
tecting group anomaly. In this work, M groups of data 
points are represented by a set of M probability dis- 
tributions assumed to be i.i.d. realization of some un- 
known distribution £P . In practice, only i.i.d samples 
from these distributions are observed. Hence, we can 
treat group anomaly detection as detecting the anoma- 
lous distributions based on their empirical samples. 
To allow for a practical algorithm, the distributions 
are mapped into the reproducing kernel Hilbert space 
(RKHS) using the kernel mean embedding. By work- 
ing directly with the distributions, the higher-order in- 
formation arising from the aggregate behaviors of the 
data points can be incorporated efficiently. 

2 Quantile Estimation on Probability 
Distributions 

Let X denote a non-empty input space with associated 
cr-algebra A, P denote the probability distribution on 
(X, A), and tyx denote the set of all probability dis- 
tributions on (X,A). The space tyx is endowed with 
the topology of weak convergence and the associated 
Borel cr-algebra. 

We assume that there exists a distribution on 
where Pi,...,P^ are i.i.d. realizations from and 
the sample Si is made of Hi i.i.d. samples distributed 
according to the distribution Pj. In this work, we ob- 
serve £ samples Si = {^^ }i<fe<n< for i = 1, ...,£. For 
each sample Si, Pi = X}j=i is the associated 
empirical distribution of Pj. 



In this work, we formulate a group anomaly detection 
problem as learning quantile function q : tyx — > R to 
estimate the support of '. Let C be a class of mea- 
surable subsets of tyx and A be a real- valued function 
defined on C, the quantile function w.r.t. (^ 2> ,C,A)is 

q[fi) = inf{A(C7) : <3>(C) >P,CeC] , 

where < j3 < 1. In this paper, we consider when A is 
Lebesgue measure, in which case C(/3) is the minimum 
volume C € C that contains at least a fraction f3 of the 
probability mass of £P . Thus, the function q can be 
used to test if any test distribution P t is anomalous 
w.r.t. the training distributions. 

Rather than estimating C((3) in the space of distri- 
butions directly, we first map the distributions into 
a feature space via a positive semi-definite kernel k. 



Our class C is then implicitly defined as the set of 
half-spaces in the feature space. Specifically, C w = 
{P | /w(P) > p} where (w, p) are respectively a weight 
vector and an offset parametrizing a hyperplane in the 
feature space associated with the kernel k. The op- 
timal (w, p) is obtained by minimizing a regularizer 
which controls the smoothness of the estimated func- 
tion describing C. 

3 One-Class Support Measure 
Machines 

In order to work with the probability distributions ef- 
ficiently, we represent the distributions as mean func- 
tions in a reproducing kernel Hilbert space ( RKHS) 
(jBerlinet and Agnanl I200I ISmola et all 120071) . For- 
mally, let H denote an RKHS of functions / : X — >• K 
with reproducing kernel k : X x X — » K. The kernel 
mean map from tyx into TL is defined as 



x 



H, 



k(x, -)dP(x) 



(1) 



x 



We assume that k(x, •) is bounded for any x e X. For 
any P, letting /zp = /i(P), one can show that Ep[/] = 
(Vv,f)u, for all feH. 



The following theorem due to iFukumizu et al.l (|2004f ) 



and ISriperumbudur et al.l f|2010h gives a promising 



property of representing distributions as mean ele- 
ments in the RKHS. 

Theorem 1. The kernel k is characteristic if and only 
if the map ([T]) is injective. 

Examples of characteristic kernels include Gaussian 
RBF kernel and Laplace kernel. Using the character- 
istic kernel k, Theorem [1] implies that the map ([1]) pre- 
serves all information about the distributions. Hence, 
one can apply many existing kernel-based learning al- 
gorithms to the distributions as if they are individual 
samples with no information loss. 

Intuitively, one may view the mean embeddings of the 
distributions as their feature representations. Thus, 
our approach is in line with previous attempts in group 
anomaly detection that find a set of appropriate fea- 
tures for each group. On the one hand, however, the 
mean embedding approach captures all necessary in- 
formation about the groups without relying heavily 
on a specific domain knowledge. On the other hand, it 
is flexible to choose the feature representation that is 
suitable to the problem at hand via the choice of the 
kernel k. 

3.1 OCSMM Formulation 

Using the mean embedding representation ([TJ , the pri- 
mal optimization problem for one-class SMM can be 



subsequently fo rmulated in an analo gous way to the 
one-class SVM (jScholkopf et al.ll200ll) as follow: 



minimize 



1 1 - 

-(w,w) w -p+— e$ 



(2a) 



subject to (w, >p — £i,€i > (2b) 

where & denote slack variables and v e (0, 1] is a 
trade-off parameter corresponding to an expected frac- 
tion of outliers within the feature space. The trade- 
off v is an upper bound on the fraction of outliers 
and lower bound on t he fraction of support measures 
(IScholkopf et alJl200lh . 



The trade-off parameter v plays an important role in 
group anomaly detection. Small v implies that anoma- 
lous groups are rare compared to the normal groups. 
Too small v leads to some anomalous groups being re- 
jected. On the other hand, large v implies that anoma- 
lous groups are common. Too large v leads to some 
normal groups being accepted as anomaly. As group 
anomaly is subtle, one need to choose v very carefully 
to reduce the effort in the interpretation of the results. 

By introducing Lagrange multipliers a, we have w = 

Yli=i = Si=i a i^Pi [k(x, •)] an( i the dual form of 

([2]) can be written as 

l 1 



minimize aiCt i ' ( 3a ) 
i=i 3=1 

1 e 

subject to < 04 < — , 22 = (3b) 



Note that the dual form is a quadratic programming 
and depends on the inner product (/ip i} UpAu- Given 
that we can compute (Atp^/zp.)^, we can employ the 
standard QP solvers to solve 

3.2 Kernels on Probability Distributions 

From (J3|), we can see that /ip is a feature map asso- 
ciated with the kernel K : tyx X tyx -> K, defined 
as K(Pi,Pj) = (/ipj, fir^H- It follows from Fubini's 
theorem and reproducing property of n that 



< m , MP > = 1 1 {k(x r ),k(y,-)) n dP l (x)d¥ J (y) 
k(x,y)dP % (x)dP 3 (y) . (4) 



Hence, if is a positive definite kernel on tyx- Given 
the sample sets Si, . . . , Se, one can estimate d3| by 

-, rii nj 

K&fa) = EE^M ) (5) 

1 i k=i 1=1 



where 



G 5i 



G Sj, and rt^ is the number of 



samples in Si for i = 1, 



Previous works in kernel-based anomaly detection 
have shown that the Gaussian RBF kernel is more 
suitable than some othe r kernels such as polynomial 
kernels (|Hoffmannl 120071 ) . Thus we will focus primar- 
ily on the Gaussian RBF kernel given by 



kcr(x, x') = exp ( — 



2ct 2 



x, x' € X 



(6) 



where a > is a bandwidth parameter. In the sequel, 
we denote the reproducing kernel Hilbert space asso- 
ciated with kernel k a by TL a . Also, let $ : X — > Ha- be 
a feature map such that k a (x,x') = (&(x),&(x'))u a - 

In group anomaly detection, we always observe the 
i.i.d. samples from the distribution underlying the 
group. Thus, it is natural to use the empirical ker- 
nel ([5]). However, one may relax this assumption and 
apply the kernel (U|) directly. For instance, if we have a 
Gaussian distribution Pj = J\f(rrii, Si) and a Gaussian 
RBF kernel k a , we can compute the kernel analytically 
by 



if (Pi, Pi) 



where B 



exp (— i(mj — rrij) 1 B 1 (m i — rrij)) 



(7) 



i + Sj + <7 2 I. This kernel is particularly 
useful when one want to incorporate the point- wise un- 
certainty of the observ ation into the learning algorithm 
( Muandet et al.|[2012 ). More details will be given in 
Section 14721 and 151). 

4 Theoretical Analysis 

This section presents some theoretical analyses. The 
geometrical interpretation of OCSMMs is given in Sec- 
tion [~CTJ Then, we discuss the connection of OCSMM 
to the kernel density estimator in Section [4721 In the 
sequel, we will focus on the translation-invariant kernel 
function to simplify the analysis. 

4.1 Geometric Interpretation 

For translation-invariant kernel, k{x, x) is constant for 
all x £ X. That is, = r for some constant 

p. This implies that all of the images $(a;) lie on the 
sphere in the feature space (cf. Figure l2aj). Conse- 
quently, the following inequality holds 



\M\n 



k(x, -)dF(x) 



< 



\\k(x,-)\\ n d¥(x) 



which shows that all mean embeddings lie inside the 
sphere (cf. Figure l2af . As a result, we can establish 
the existence and uniqueness of the separating hyper- 
plane w in ([2j through the following theorem. 



Theorem 2. There exists a unique separating hy- 
perplane w as a solution to (|2|) that separates 
MPi i A*2 j • ■ • j Wp* f rom the origin. 

Proof. Due to the separability of the feature maps 
$(x), the convex hull of the mean embeddings 
A*Pi , Mp 2 ! • ■ • j f-Pi does not contain the origin. The 
existence and uniqueness of the hyperplane then 
follows from the support ing hyperplane theorem 
(|Scholkopf and Smolall200lh . ■ 



By Theorem^ the OCSMM is a simple generalization 
of OCSVM to the space of probability distributions. 
Furthermore, the straightforward generalization will 
allow for a direct application of an efficient learning 
algorithm as well as existing theoretical results. 

There is a well-known connection between the solu- 
tion of OCSVM with translation invariant kernels and 
the center of the minimum enclosing sphere (MES) 
()Tax and Duinl Il999t 12004 . Intuitively, this is not 
the case for OCSMM, even when the kernel k is 
translation-invariant, as illustrated in Figure [2b"l For- 
tunately, the connection between OCSMM and MES 
can be made precise by applying the spherical normal- 
ization 



H 



H 



(8) 



H 



After the normalization, ||/xp||-H = 1 for all P s tyx- 
That is, all mean embeddings lie on the unit sphere 
in the feature space. Consequently, the OCSMM and 
MES are equivalent after the normalization. 

Given the equivalence between OCSMM and MES, it 
is natural to ask if the spherical normalization © pre- 
serves the injectivity of the Hilbert space embedding. 
In other words, is there an information loss after the 
normalization? The following theorem answers this 
question for kernel k that satisfies some reasonable as- 
sumptions. 

Theorem 3. Assume that k is characteristic and the 
samples are linearly independent in the feature space 
%. Then, the spherical normalization preserves the 
injectivity of the mapping /i : tyx H. 

Proof. Let us assume the normalization does not pre- 
serve the injectivity of the mapping. Thus, there ex- 
ist two distinct probability distributions P and Q for 
which 

k(x, ■) d¥(x) = J k(x,-)dQ(x) 
J k(x,-)d(P -Q){x) = . 



(a) feature map and mean map 



(b) minimum enclosing sphere 



(c) spherical normalization 



Figure 2: (jaj) The two dimensional representation of the RKHS of Gaussian RBF kernels. Since the kernels 
depend only on x — x', k(x,x) is constant. Therefore, all feature maps $(x) (black dots) lie on a sphere in 
feature space. Hence, for any probability distribution P, its mean embedding fip always lies in the convex hull 
of the feature maps, which in this case, forms a segment of the sphere. (|bj In general, the solution of OCSMM 
is different from the minimum enclosing sphere. (juj| Three dimensional sphere in the feature space. For the 
Gaussian RBF kernel, the kernel mean embeddings of all distributions always lie inside the segment of the 
sphere. In addition, the angle between any pair of mean embeddings is always greater than zero. Consequently, 
the mean embeddings can be scaled, e.g., to lie on the sphere, and the map is still injective. 



As P Q, the last equality holds if and only if there 
exists x € X for which k(x, •) are linearly dependent, 
which contradicts the assumption. Consequently, the 
spherical normalization must preserve the injectivity 
of the mapping. ■ 

The Gaussian RBF kernel satisfies the assumption 
given in Theorem [3] as the kernel matrix will be full- 
rank and thereby the samples are linearly independent 
in the feature space. Figure l2"cl depicts an effect of the 
spherical normalization. 

It is important to note that the spherical normalization 
does not necessarily improve the performance of the 
OCSMM. It ensures that all the information about the 
distributions are preserved. 



w of pj) can be written as a uniform sum over training 
samples similar to ([9]). Moreover, setting v < 1 yields 
a sparse representation where the summand consists 
of only support vectors of the OCSVM. 

Interestingly, we can make a similar correspondence 
between the K DE and the OCSMM It follows from 
Lemma 4 of Muandet et al.l ( 20121 ) that for cer- 



tain classes of training probability distributions, the 
OCSMM on these distributions corresponds to the 
OCSVM on some training samples equipped with 
an appropriate kernel function. To understand this 
connection, consider the OCSMM with the Gaussian 
RBF kernel k a and isotropic Gaussian distributions 

N(mi;ai),N(m2;a%),...,Af(m n ',(Tnf} We analyze 
this scenario under two conditions: 



4.2 OCSMM and Density Estimation 

In this section we make a connection between the OC- 
SMM and kernel density estimation (KDE). First, we 
give a definition of the KDE. Let x\, x%, . . . , x n be an 
i.i.d. samples from some distribution F with unknown 
density /, the KDE of / is defined as 

i— 1 x ' 

For / to be a density, we require that the kernel satis- 
fies k(-, •) > and J k(x, ■) dx = 1, which includes, for 
example, the Gaussian kernel, the multivariate Stu- 
dent kernel, and the Laplacian kernel. 



(CI) Identical bandwidth. If er^ = <jj for all 1 < 

i,j < n, the OCSMM is equivalent to the OCSVM 
on the training samples mi, m.2, . . . , m n with Gaussian 
RBF kernel k a 2 +a 2 (cf. the kernel 0). Hence, the 
OCSMM corresponds to the OCSVM on the means of 
the distributions with kernel of larger bandwidth. 

(C2) Variable bandwidth. Similarly, if a\ ^ <Tj 
for some 1 < i,j < n, the OCSMM is equivalent to 
the OCSVM on the training samples mi, m,2, • ■ • , m n 
with Gaussian RBF kernel k a i +a i. Note that the ker- 
nel bandwidth may be different at each training sam- 
ples. Thus, OCSMM in this case corresponds to the 
OCSVM with variable bandwidth parameters. 



When v = 1, it is well-known that, under some techni- 
cal assum ptions, the OCSVM c orresponds exactly to 
the KDE (jScholkopf et al.ll200ll ). That is, the solution 



We adopt the Gaussian distributions here for the sake 
of simplicity. More general statement for non-Gaussian 
distributions follows straightforwardly. 



On the one hand, the above scenario allows the 
OCSVM to cope with noisy/uncertain inputs, lead- 
ing to more robust point anomaly detection algorithm. 
That is, we can treat the means as the measurements 
and the covariances as the measurement uncertainties 
(cf. Section I5.2j) . On the other hand, one can also 
interpret the OCSMM when v — 1 as a generalization 
of traditional KDE, where we have a data-dependent 
bandwidth at each data point. This type of KDE 
is known in the stati stics as variable kernel density 
estimators (VKDEs) (Abramson 1982 , Breiman et al.l 
1977llTerrell and Scottlll992h . For v < 1, the OCSMM 
gives a sparse representation of the VKDE. 

Formally, the VKDE is characterized by © with an 
adaptive bandwidth h(xi). For example, the band- 
width is adapted to be larger where the data are less 
dense, with the aim to reduce the bias. There are basi- 
cally two different vie ws of VKDE. The first i s known 
as a balloon estimator ( Terrell and ScottJll992 ). Essen- 
tially, its bandwidth may depend only on the point at 
which the estimate is taken, i.e., the bandwidth in © 
may be written as h(y). The s econd type of VKDE is a 
sample smoothing estimator ([Terrell and Scottlll992t) . 
As opposed to the balloon estimator, it is a mixture 
of individually scaled kernels centered at each obser- 
vation, i.e., the bandwidth is h(xi). The advantage 
of balloon estimator is that it has a straightforward 
asymptotic analysis, but the final estimator may not 
be a density. The sample smoothing estimator is a 
density if k is a density, but exhibits non-locality. 

Both types of the VKDEs may be seen from the OC- 
SMM point of view. Firstly, under the condition (CI), 
the balloon estimator can be recovered by consider- 
ing different test distribution P t = JV(m t ;at)- As 
at — > 0, one obtain the standard KDE on nit- Sim- 
ilarly, the OCSMM under the condition (C2) with 
Pt = 5m t gives the sample smoothing estimator. Inter- 
estingly, the OCSMM under the condition (C2) with 
P t = J\f(m t ; a t ) results in a combination of these two 
types of the VKDEs. 

In summary, we show that many variants of KDE can 
be seen as solutions to the regularization functional 
(|2j), and thereby provides an insight into a connection 
between large-margin approach and kernel density es- 
timation. 

5 Experiments 

We firstly illustrate a fundamental difference between 
point and group anomaly detection problems. Then, 
we demonstrate an advantage of OCSMM on uncer- 
tain data when the noise is observed explicitly. Lastly, 
we compare the OCSMM with existing group anomaly 
detection techniques, namely, iVnearest neighbor 



(KNN ) based anomaly detection (jZhao and Saligrama 
2009) with NP-L/2 diverg ence and NP-Renyi diver- 



gence (jPoczos et al.l 1201 ll). and M ultinomial Genre 



Model (MGM) (|Xiong et al.ll20lTbh on Sloan Digital 
Sky Survey (SDSS) dataset and High Energy Particle 
Physics dataset. 

Model Selection and Setup. One of the long- 
standing problems of one-class algorithms is model 
selection. Since no labeled data is available during 
training, we cannot perform cross validation. To en- 
courage a fair comparison of different algorithms in 
our experiments, we will try out different parame- 
ter settings and report the best performance of each 
algorithm. We believe this simple approach should 
serve its purpose at reflecting the relative performance 
of different algorithms. We will employ the Gaus- 
sian RBF kernel ((6]) throughout the experiments. For 
the OCSVM and the OCSMM, the bandwidth pa- 



rameter a 2 is fixed at medianj 



„0) 



.0') i 



2 } for all 

i,j,k,l where xY J denotes the k-th data point in the 
i-th group, and we consider v — (0.1, 0.2, . . . , 0.9). 
The OCSVM treats group means as training sam- 
ples. For synthetic experiments with OCSMM, we use 
the empirical kernel (JS|), whereas the non-linear kernel 
K(¥i,Fj) = exp(||// Fj - /i P jy27 2 ) will be used for 
real data where we set 7 = o~. Our experiments sug- 
gest that these choices of parameters usually work well 
in practice. For KNN-L 2 and KNN-Renyi (a=0.99), 
we consider when there are 3,5,7,9, and 11 nearest 
neighbors. For MGM, we f o llow th e same experimen- 
tal setup as in lXiong et al. I (l2011bh . 



5.1 Synthetic Data 

To illustrate the difference between point anomaly 
and group anomaly, we represent the group of data 
points by the 2-dimensional Gaussian distribution. We 
generate 20 normal groups with the covariance S = 
[0.01, 0.008; 0.008, 0.01]. The means of these groups 
are drawn uniformly from [0, 1]. Then, we generate 2 
anomalous groups of Gaussian distributions whose co- 
variances are rotated by 60 degree from the covariance 
S. Furthermore, we perturb one of the normal groups 
to make it relatively far from the rest of the dataset 
to introduce an additional degree of anomaly (cf. Fig- 
ure [3a|). Lastly, we generate 100 samples from each of 
these distributions to form the training set. 

For the OCSVM, we represent each group by its empir- 
ical average. Since the expected proportion of outliers 
in the dataset is approximately 10%, we use v = 0.1 
accordingly for both OCSVM and OCSMM. FigureGH 
depicts the result which demonstrates that the OC- 
SMM can detect anomalous aggregate patterns unde- 
tected by the OCSVM. 



One-Class Support Vector Machine 




One-Class Support Measure Machine 
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(a) OCSVM vs OCSMM 



(b) The results of the OCSMM on the mixture of Gaussian dataset 



Figure 3: (jgjl The results of group anomaly detection on synthetic data obtained from the OCSVM and the 
OCSMM. Blue dashed ovals represent the normal groups, whereas red ovals represent the detected anomalous 
groups. The OCSVM is only able to detect the anomalous groups that are spatially far from the rest in the 
dataset, whereas the OCSMM also takes into account other higher-order statistics and therefore can also detect 
anomalous groups which possess distinctive properties. (0 The results of the OCSMM on the synthetic data of 
the mixture of Gaussian. The shaded boxes represent the anomalous groups that have different mixing proportion 
to the rest of the dataset. The OCSMM is able to detects the anomalous groups although they look reasonably 
normal and cannot be easily distinguished from other groups in the data set based only on an inspection. 



Then, we c onduct similar experiment as that in 5.2 Noisy Data 



Xiong et al.1 (|2011bh . That is, the groups are repre- 
sented as a mixture of four 2-dimensional Gaussian 
distributions. The means of the mixture components 
are [—1, —1], [1, —1], [0, 1], [1, 1] and the covariances are 
all £ = 0.15 x I2, where 1 2 denotes the 2D iden- 
tity matrix. Then, we design two types of normal 
groups, which are specified by two mixing propor- 
tions [0.22,0.64,0.03,0.11] and [0.22,0.03,0.64,0.11], 
respectively. To generate a normal group, we first de- 
cide with probability [0.48,0.52] which mixing propor- 
tion will be used. Then, the data points are generated 
from mixture of Gaussian using the specified mixing 
proportion. The mixing proportion of the anomalous 
group is [0.61,0.1,0.06,0.23]. 

We generated 47 normal groups with n, ~ 
Poisson(300) instances in each group. Note that the 
individual samples in each group are perfectly normal 
compared to other samples. To test the performance 
of our technique, we inject the group anomalies, where 
the individual points are normal, but they together as 
a group look anomalous. In this anomalous group the 
individual points are samples from one of the K = 4 
normal topics, but the mixing proportion was different 
from both of the normal mixing proportions. We inject 
3 anomalous groups into the data set. The OCSMM 
is trained using the same setting as in the previous 
experiment. The results are depicted in Figure [3bl 



As discussed at the end of Section [331 the OCSMM 
may be adopted to learn from data points whose un- 
certainties are observed explicitly. To illustrate this 
claim, we generate samples from the unit circle using 
x = cos 9 + e and y = sin 9 + e where 9 ~ (— 7r, n] and 
e is a zero-mean isotropic Gaussian noise 7V(0, 0.05). 
A different point- wise Gaussian noise A/"(0, u)i) where 
uji 6 (0.2,0.3) is further added to each point to simu- 
late the random measurement corruption. In this ex- 
periment, we assume that (Ji is available during train- 
ing. This situation is often encountered in many ap- 
plications such as astronomy and computational biol- 
ogy. Both OCSVM and OCSMM are trained on the 
corrupted data. As opposed to the OCSVM that con- 
siders only the observed data points, the OCSMM also 
uses Ui for every point via the kernel (JT)). Then, we 
consider a slightly more complicate data generated by 
x = r ■ cos(#) and y = r ■ sin(0) where r = sin(4#) + 2 
and 9 G (0, 2tt]. The data used in both examples are 
illustrated in Figure SJ 

As illustrated by Figure [4j the density function es- 
timated by the OCSMM is relatively less susceptible 
to the additional corruption than that estimated by 
the OCSVM, and tends to estimate the true density 
more accurately. This is not surprising because we also 
take into account an additional information about the 
uncertainty. However, this experiment suggests that 
when dealing with uncertain data, it might be ben- 
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Figure 4: The density functions estimated by the 
OCSVM and the OCSMM using the corrupted data. 
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Figure 5: The average precision (AP) and area un- 
der the ROC curve (AUC) of different group anomaly 
detection algorithms on the SDSS dataset. 



eficial to also estimate the uncertainty, as commonly 
performed in astronomy, and incorporate it into the 
model. This scenario has not been fully investigated 
in AI and machine learning communities. Our frame- 
work provides one possible way to deal with such a 
scenario. 

5.3 Sloan Digital Sky Survey 

Sloan Digital Sky Survey (SDSS jl consists of a series of 
massive spectroscopic surveys of the distant universe, 
the milky way galaxies, and extrasolar planetary sys- 
tems. The SDSS datasets contain images and spectra 
of more than 930,000 galaxies and more than 120,000 
quasars. 

In this experiment, we are interested in identifying 
a nomalous groups of gala xies, as prev i ously s tudied 



Poczos et al.l (|201lD and lXiong et all (l2011aUbD. To 



replicat e the experiments conducted in IXiong et al.l 
(|2011bl ), we use the same dataset which consists of 
505 spatial clusters of galaxies. Each of which con- 
tains about 10-15 galaxies. The data were prepro- 
cessed by PCA to reduce the 1000-dimensional features 
to 4-dimensional vectors. 

To evaluate the performance of different algorithms to 
detect group anomaly, we consider artificially random 
injections. Each anomalous group is constructed by 
randomly selecting galaxies. There are 50 anomalous 
groups of galaxies in total. Note that although these 
groups of galaxies contain usual galaxies, their aggre- 
gations are anomalous due to the way the groups are 
constructed. 

The average precision (AP) and area under the ROC 
curve (AUC) from 10 random repetitions are shown in 
Figure [3] Based on the average precision, KNN-L2, 
MGM, and OCSMM achieve similar results on this 
dataset and KNN-Renyi outperforms all other algo- 
rithms. On the other hand, the OCSMM and KNN- 



Renyi achieve highest AUC scores on this dataset. 
Moreover, it is clear that point anomaly detection us- 
ing the OCSVM fails to detect group anomalies. 

5.4 High Energy Particle Physics 

In this section, we demonstrate our group anomaly 
detection algorithm in high energy particle physics, 
which is largely the study of fundamental parti- 
cles, e.g., neutrinos, and their interactions. Essen- 
tially, all particles and their dynamics can be de- 
scribed by a quantum field theory called the Stan- 
dard Model. Hence, given massive datasets from high- 
energy physics experiments, one is interested in discov- 
ering deviations from known Standard Model physics. 

Searching for the Higgs boson, for example, has re- 
cently received much attention in particl e physics an d 
machine learning com munities (see e.g., iBhad ()2011 ). 
IVatanen et al.l ( 2012 ) and references therein). A new 
physical phenomena usually manifest themselves as 
tiny excesses of certain types of collision events among 
a vast background of known physics in particle detec- 
tors. 

Anomalies occur as a cluster among the background 
data. The background data distribution contaminated 
by these anomalies will therefore be different from the 
true background distribution. It is very difficult to de- 
tect this difference in general because the contamina- 
tion can be considerably small. In this experim e nt, we 
consider similar condition as in IVatanen et al. ( 2012 ) 
and generate data using the standard HEP Monte 
Carlo generators such as pythiaB In particular, we 
consider a Monte Carlo simulated events where the 
Higgs is produced in association with the W boson 
and decays into two bottom quarks. 

The data vector consists of 5 variables (p x ,p y ,p z , e, m) 
corresponding to different characteristics of the topol- 
ogy of a collision event. The variables p x ,p y ,p Zl e rep- 



See http: / /www.sdss.org| for the detail of the surveys. 



^http://home. thep.lu.se/~to rbjorn/Pythia. html 
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Figure 6: The ROC of different group anomaly detection algorithms on the Higgs boson datasets with 
various Higgs masses tyih ■ The associated AUC scores for different settings, sorted in the same or- 
der appeared in the figure, are (0.6835,0.6655,0.6350,0.5125,0.7085), (0.5645,0.6783,0.5860,0.5263,0.7305), 
(0.8190,0.7925,0.7630,0.4958,0.7950), and (0.6713,0.6027,0.6165,0.5862,0.7200). 



resents the momentum four-vector in units of GeV 
with c = 1 . The variable m is the particle mass in the 
same unit. The signal looks slightly different for dif- 
ferent Higgs masses run , which is an unknown free pa- 
rameter in the Standard Model. In this experiment, we 
consider ran = 100, 115, 135, and 150 GeV. We gener- 
ate 120 groups of collision events, 100 of which contain 
only background signals, whereas the rest also contain 
the Higgs boson collision events. For each group, the 
number of observable particles ranges from 200 to 500 
particles. The goal is to detect the anomalous groups 
of signals which might contain the Higgs boson with- 
out prior knowledge of m#. 

Figure [6] depicts the ROC of different group anomaly 
detection algorithms. The OCSMM and KNN-based 
group anomaly detection algorithms tend to achieve 
competitive performance and outperform the MGM 
algorithm. Moreover, it is clear that traditional point 
anomaly detection algorithm fails to detect high-level 
anomalous structures. 

6 Conclusions and Discussions 

To conclude, we propose a simple and efficient algo- 
rithm for detecting group anomalies called one-class 
support measure machine (OCSMM). To handle ag- 
gregate behaviors of data points, groups are repre- 
sented as probability distributions which account for 
higher-order information arising from those behaviors. 
The set of distributions are represented as mean func- 
tions in the RKHS via the kernel mean embedding. We 
also extend the relationship between the OCSVM and 
the KDE to the OCSMM in the context of variable 
kernel density estimation, bridging the gap between 
large-margin approach and kernel density estimation. 
We demonstrate the proposed algorithm on both syn- 
thetic and real-world datasets, which achieve compet- 
itive results compared to existing group anomaly de- 



tection techniques. 

It is vital to note the differences between the OCSMM 
and hierarchical probabilistic models such as MGM 
and FGM. Firstly, the probabilistic models assume 
that data are generated according to some paramet- 
ric distributions, i.e., mixture of Gaussian, whereas 
the OCSMM is nonparametric in the sense that no 
assumption is made about the distributions. It is 
therefore applicable to a wider range of applications. 
Secondly, the probabilistic models follow a bottom- 
up approach. That is, detecting group-based anoma- 
lies requires point-based anomaly detection. Thus, 
the performance also depends on how well anomalous 
points can be detected. Furthermore, it is computa- 
tional expensive and may not be suitable for large- 
scale datasets. On the other hand, the OCSMM 
adopts the top-down approach by detecting the group- 
based anomalies directly. If one is interested in find- 
ing anomalous points, this can be done subsequently 
in a group-wise manner. As a result, the top-down 
approach is generally less computational expensive 
and can be used efficiently for online applications and 
large-scale datasets. 
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